This lab offers an introduction to some of the ideas in text analysis. The code here is not necessarily indicative of best practice- but it is hopefully helpful as a springboard for basic concepts.
We will primarily be using Ken Benoit’s quanteda package for text manipulation. There are several other fantastic options including corpus, tidytext and tm. stm which we will use for topic modeling also has some pre-processing tools. quanteda has a great vignette to help you get started (here). Ken also has great exercises available from his course.
Most of the software packages are written by academics. Reliable and easy-to-use software is really difficult to make. If you use these packages in your published work: please cite them. In R you can even see how the author would like to be cited (and get a bibtex entry).
citation("quanteda")
## Warning in as.POSIXlt.POSIXct(Sys.time()): unknown timezone 'zone/tz/2017c.
## 1.0/zoneinfo/America/Detroit'
##
## Benoit K (2017). _quanteda: Quantitative Analysis of Textual
## Data_. doi: 10.5281/zenodo.1004683 (URL:
## http://doi.org/10.5281/zenodo.1004683), R package version 0.99.22,
## <URL: http://quanteda.io>.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {quanteda: Quantitative Analysis of Textual Data},
## author = {Kenneth Benoit},
## year = {2017},
## doi = {10.5281/zenodo.1004683},
## url = {http://quanteda.io},
## note = {R package version 0.99.22},
## }
In the folder I’ve included a number of alternative datasets that you can work with. quanteda, stm and the mnir package also have a number of additional text sources. Even better- use your own!
We start with data downloaded from JSTOR’s Data For Research (DFR) database. I’ve pulled the abstracts for all articles published in the American Journal of Sociology (AJS) and American Sociological Review (ASR) for the years 1975-2013. DFR limits data pulls to 1000 documents at a time, so there are five tab-separated files we’ll be working with. These files are in a .zip file at the following link
First, load some of the general packages.
library(tidyverse)
library(quanteda)
For each of the JSOTR data pulls, there is a citations.tsv file that has the tab-separated citation information of each document, including the text of the article’s abstract.
Unfortunately, the format of the .tsv contains a trailing tab – so annoying with tab-delimmeted data. So we need to read in the data and the column names separately or R gets too clever and we end up with mismatched names and data. To make this easier, we provide the following function to load in the data. There is often some annoying thing you need to do to get your data loaded into R.
# load ASR data
getJournalCites <- function(file.loc) {
#
col.names <- unlist(read.delim(file.loc, header=F, stringsAsFactors = F, nrow=1))
cites <- read.delim(file.loc, header=F, stringsAsFactors = F, skip=1)
# remove the trailing column, and any without abstracts
cites <- select(cites, -14)
names(cites) <- col.names
cites <- filter(cites, nchar(abstract)>0)
cites$year <- format(as.Date(cites$pubdate), "%Y")
return(cites)
}
# load each in turn
ajs.cites <- getJournalCites("data/ajs_1997_2013/citations.tsv")
ajs.cites.2 <- getJournalCites("data/ajs_75_96/citations.tsv")
asr.cites <- getJournalCites("data/asr_1997_2013/citations.tsv")
asr.cites.2 <- getJournalCites("data/asr_85_96/citations.tsv")
asr.cites.3 <- getJournalCites("data/asr_75_84/citations.tsv")
# package into a single data frame
cites <- rbind(ajs.cites, ajs.cites.2, asr.cites, asr.cites.2, asr.cites.3)
cites$doc_id <- paste(cites$author, cites$year) #let's make a variable to identify it
Take a look at the data frame you just produced. For small to medium sized document collections (let’s say less than a million documents), this “spreadsheet” format is often the easiest way to work with data.
The quanteda package contains some great tools for easily processing a set of documents into the document-term matrix numerical representation we’ll use. First, we construct a corpus object from the loaded data. We’ll use the “id” field for docnames instead of the “title” field because titles are not 100% unique in these data. The docvars option lets us bring in other document metadata we’d like to have available with the corpus later.
corpus <- corpus(cites,
docid_field="doc_id",
text_field="abstract")
summary(corpus,3)
## Corpus consisting of 3402 documents, showing 3 documents:
##
## Text Types Tokens Sentences
## Paul Ingram, Jeffrey Robinson, Marc L. Busch 2005 97 154 5
## Michael Hechter 2004 73 139 6
## Don Tomaskovic‐Devey, Sheryl Skaggs 2002 112 175 8
## id doi
## 10.1086/497350 10.1086/497350
## 10.1086/421357 10.1086/421357
## 10.1086/344214 10.1086/344214
## title
## The Intergovernmental Network of World Trade: IGO Connectedness, Governance, and Embeddedness
## From Class to Culture
## Sex Segregation, Labor Process Organization, and Gender Earnings Inequality
## author
## Paul Ingram, Jeffrey Robinson, Marc L. Busch
## Michael Hechter
## Don Tomaskovic‐Devey, Sheryl Skaggs
## journaltitle volume issue pubdate
## American Journal of Sociology 111 3 2005-11-01T00:00:00Z
## American Journal of Sociology 110 2 2004-09-01T00:00:00Z
## American Journal of Sociology 108 1 2002-07-01T00:00:00Z
## pagerange publisher type reviewed-work year
## pp. 824-858 The University of Chicago Press fla NA 2005
## pp. 400-445 The University of Chicago Press fla NA 2004
## pp. 102-128 The University of Chicago Press fla NA 2002
##
## Source: /Users/brandonstewart/Dropbox/Slides/MichiganWorkshop2017/labs/* on x86_64 by brandonstewart
## Created: Thu Nov 30 21:44:27 2017
## Notes:
Next, we transform the corpus into a document feature matrix (also called a document-term matrix, document-token matrix, or document-word matrix) where each row is a document (article) and each column is a distinct term. As we’ll see below, this object can be used in many places as a document-term matrix. The dfm() has an option verbose= that defaults to \(TRUE\); turning this to \(FALSE\) will mask the following output.
mydfm <- dfm(corpus, remove=stopwords("english"), remove_punct=TRUE)
We can also remove rare terms. The advantage of this trimming is that each word is a dimension in the vector representation of each document. Thus, we remove a lot of dimensions from our data without losing very much information. Setting verbose=TRUE provides us with some useful information about the number of features that were removed.
mydfm <- dfm_trim(mydfm, min_count = 5, min_docfreq = 3, verbose = TRUE)
## Removing features occurring:
## - fewer than 5 times: 14,248
## - in fewer than 3 documents: 12,707
## Total features removed: 14,429 (69.9%).
Quanteda makes it easy to get quick overviews of your data. The first simple summary is the most frequently occuring words. This is often a good first check that something hasn’t gone horribly wrong. For example, when first preparing this, I forgot to remove punctuation and many pieces of punctuation were the most frequent terms.
topfeatures(mydfm, 10)
## social data effects theory analysis model economic
## 2961 1686 1340 1216 1169 1069 1010
## political study research
## 990 982 976
A fun way of summarizing your text is with a word cloud package. quanteda can do this for us as well. Under the hood, quanteda is calling the wordcloud R package to do the magic.
set.seed(48109)
textplot_wordcloud(mydfm, min.freq = 6, random.order = FALSE,
rot.per = .25, max.words=50,
colors = RColorBrewer::brewer.pal(8,"Dark2"))
A tremendously useful technique is Key Words In Context (KWIC), This let’s you know that what you are guessing about how a word is used is actually right.
kwic(corpus, "nuance*", window=3)
##
## [Nicholas Pedriana, Robin Stryker 2004, 156] authors contribute to |
## [Rick Grannis 2010, 48] . Similarly, |
## [Nicola Beisel, Tamara Kay 2004, 67] develops a more |
## [Iddo Tavory, Ann Swidler 2009, 182] for a more |
## [John Iceland, Kyle Anne Nelson 2008, 56] obtain a more |
## [Marc Dixon, Andrew W. Martin 2012, 184] suggest a more |
## [Timothy J. Owens 1994, 154] are employed, |
##
## nuanced | explanations of state
## nuances | in how we
## nuanced | approach to intersectionality
## nuanced | analysis of culture
## nuanced | view of spatial
## nuanced | picture of union
## nuances | are revealed that
kwic(corpus, phrase("causal inference"), window=3)
##
## [James Mahoney 1999, 13:14]
## [Geoffrey T. Wodtke, David J. Harding, Felix Elwert 2011, 109:110]
##
## single strategy of | causal inference | . In fact
## novel methods of | causal inference | for time-varying treatments
Often in the social sciences we care about contrasts in language between one group and another. The following code sums over the relevant grouping variable.
group_dfm <- dfm(corpus, remove=stopwords("english"), remove_punct=TRUE,
groups="journaltitle")
group_dfm[,1:10]
## Document-feature matrix of: 2 documents, 10 features (0% sparse).
## 2 x 10 sparse Matrix of class "dfm"
## features
## docs membership certain intergovernmental
## American Journal of Sociology 40 47 4
## American Sociological Review 78 59 3
## features
## docs organizations igos world trade
## American Journal of Sociology 201 11 102 35
## American Sociological Review 269 5 195 35
## features
## docs organization long argued
## American Journal of Sociology 146 39 52
## American Sociological Review 210 65 67
By adding comparison=TRUE we can now plot a word cloud that compares the groups in our dataset.
set.seed(48109)
textplot_wordcloud(group_dfm,max.words=50,
colors = RColorBrewer::brewer.pal(8,"Dark2"),
comparison=TRUE)
We are going to analyze this corpus using the \(k\)-means algorithm. \(k\)-means is the most popular (and arguably important) flat clustering algorithm. Its objective is to minimize the average squared Euclidean distance of documents from their cluster centers where a cluster center is defined as the mean or centroid \(\overrightarrow{\mu}\) of the documents in a cluster \(\omega: \overrightarrow{\mu}(\omega) = \frac{1}{|\omega|}\sum_{\overrightarrow{x} \in \omega} \overrightarrow{x}\) (Manning, Raghavan and Schutze, 2008, 360).
The \(k\)-means algorithm uses a random initialization. In order to get the same results each time we will set the random seed. We will need to rerun this code immediately before executing the model if we want to guarantee the same results.
set.seed(12345)
Obviously we can try other seeds to see if we get different results.
The kmeans() function in R takes three relevant inputs: our data matrix (input)) our k or the number of clusters we are looking for and the number of starts. The number of starts will initialize the algorithm with however many random starting points the user requests. Then it displays the results with the lowest residual sum of squares. If your computer is running slowly you can decrease the number of starts.
For the purposes of this exercise to make this run faster we are going to really aggressively trim down the size of the documents.
mydfm <- dfm_trim(mydfm, min_count = 50, min_docfreq = 25, verbose = TRUE)
## Removing features occurring:
## - fewer than 50 times: 4,987
## - in fewer than 25 documents: 4,367
## Total features removed: 4,995 (80.5%).
In addition to generating clusters for 10, let’s compare that to 5 clusters. Also, let’s see if subsequent runs of kmeans produces similar clusters.
Code runtime interlude: What’s k-means doing? ?
kmeans.results.10 <- kmeans(mydfm, centers = 10, nstart = 3)
The cluster element of these results objects has a cluster number for each of the documents in our corpus. Of course, simply looking at this list gives us very little indication of what’s going on.
head(kmeans.results.10$cluster)
## Paul Ingram, Jeffrey Robinson, Marc L. Busch 2005
## 9
## Michael Hechter 2004
## 10
## Don Tomaskovic‐Devey, Sheryl Skaggs 2002
## 9
## Merril Silverstein, Vern L. Bengtson 1997
## 9
## Paul M. Hirsch 1997
## 9
## Susan Clampet-Lundquist, Kathryn Edin, Jeffrey R. Kling, Greg J. Duncan 2011
## 9
Cluster labels (or topic names) can be generated automatically, but you should always manually read a sample of documents as well.
The guiding principle behind automatic labelling of clusters is to find those words that best characterize the cluster. Any automatic labeling strikes a balance between finding words that are representative of the cluster but also distinguish the cluster from others. The words we identify should minimize the surprise when we read the documents after viewing the labels. The variance-weighted log-odds ratio developed by Monroe et al. in “Fighting Words” is a very useful tool for ranking the words within clusters in order to generate labels.
First, we need a function, clusterFightingWords to calculate the cluster-specific variance-weighted log-odds of each term.
# dfm -- a quanteda document-term object
# clust.vect -- a boolean vector the same length of dfm, where TRUE indicates a member of the focal cluster
# alpha.0 -- the strength of the prior, expressed as number of terms per sub-corpus
clusterFightinWords <- function(dfm, clust.vect, alpha.0=500) {
# we need to get the overall corpus word distribution and the cluster-specific words dists
# y_{kw} in Monroe et al.
overall.terms <- colSums(dfm)
# n and n_k in Monroe et al.
n <- sum(overall.terms)
# alpha_{kw} in Monroe et al.
prior.terms <- overall.terms / n * alpha.0
# y_{kw}(i) in Monroe et al.
cluster.terms <- colSums(dfm[clust.vect, ])
# n_k(i) in Monroe et al.
cluster.n <- sum(cluster.terms)
cluster.term.odds <-
(cluster.terms + prior.terms) /
(cluster.n + alpha.0 - cluster.terms - prior.terms)
overall.term.odds <-
(overall.terms + prior.terms) /
(n + alpha.0 - overall.terms - prior.terms)
# usually, we'd hate to blindly log something, but as long as we have a non-zero prior,
# these will both be non-zero.
log.odds <- log(cluster.term.odds) - log(overall.term.odds)
variance <- 1/(cluster.terms + prior.terms) + 1/(overall.terms + prior.terms)
# return the variance weighted log-odds for each term
return(log.odds / sqrt(variance))
}
We also need a function, clusterLabels that can call clusterFightingWords for each cluster and package the results into a tidy set of labels.
# dfm -- a quanteda document-term object
# results -- a k-means results object
# n.terms -- the number of terms to include in the label
clusterLabels <- function(dfm, results, n.terms=10) {
clusters <- length(results$withinss)
labels <- rep("", clusters)
for (clust.num in 1:clusters) {
clust.vect <- results$cluster==clust.num
terms <- clusterFightinWords(dfm, clust.vect)
terms <- order(terms, decreasing = T)[1:n.terms]
# use the terms as indices on the features list of the dfm
labels[clust.num] <- paste(colnames(dfm)[terms], collapse=", ")
}
return(labels)
}
# In case we want to print tidy versions of the labels.
# I don't use this here, but sprintf() is super useful to know.
printLabels <- function(labels, results){
for (clust.num in length(labels)) {
# sprintf() is a little wonky, but a super powerful tool for formatting your text output
cat(sprintf("%2i) %5.0f Members | %s",
clust.num,
sum(results$cluster==clust.num),
labels[clust.num]
)
)
}
}
Now that we’ve functions defined to do the heavy lifting, it’s easier to compare our different clusterings. Note that the numbering of the clusters might be different even if clusters are very similar.
First, let’s look at our three 10-cluster runs:
clusterLabels(dfm=mydfm, results=kmeans.results.10, 10)
## [1] "mobility, occupational, intergenerational, class, origin, models, patterns, origins, career, industrial"
## [2] "labor, inequality, economic, market, income, growth, workers, countries, foreign, earnings"
## [3] "women, men, gender, women's, marriage, female, sex, men's, married, gap"
## [4] "political, democracy, protest, party, state, elites, class, democratic, interests, business"
## [5] "school, children, family, effects, educational, parental, parents, age, children's, marital"
## [6] "status, task, characteristics, occupational, characteristic, belief, beliefs, prestige, attainment, interaction"
## [7] "religious, religion, attendance, church, participation, nations, beliefs, involvement, membership, economies"
## [8] "racial, black, blacks, segregation, white, whites, residential, race, composition, areas"
## [9] "organizations, organizational, sociology, theory, can, cultural, historical, law, model, article"
## [10] "social, movement, network, movements, theory, networks, action, capital, ties, exchange"
Do these labels make sense to you? Try it again with different numbers of clusters.
We can also get a high-level view of the cluster content by building a word cloud of just the documents in the cluster. Note that there will use the raw counts of words, rather than pulling out the words that most specifically apply to the cluster of interest. It won’t be surprising if words like “social” appear in most of the word clouds for these docs. Instead of what words set this cluster apart, like our fightin’ words label provides, the word clouds show how prevelant words are within the cluster, taken in isolation.
textplot_wordcloud(mydfm[kmeans.results.10$cluster==1,], max.words=30)
Once we label something, we have a habit of letting that label stand in for the thing that it labels. It’s important, before we get too attached to any labeling of a set of documents, to confirm that what we infer from the label is equivalent to what we infer from teh documents.
Let’s grab the cluster labels for our first 10-cluster solution and a few abstracts from each cluster and see how they compare.
labels <- clusterLabels(mydfm, kmeans.results.10, 10)
# add cluster numbers to the corpus documents
corpus$documents$cluster <- kmeans.results.10$cluster
Let’s make a convenience function for displaying a sample of docs from a given cluster.
printCorpusCluster <- function(corpus, clust.num, num.docs, labels) {
print(labels[clust.num])
cluster.corpus <- corpus_subset(corpus, cluster==clust.num)
# sample some docs from the cluster sub corpus
cluster.docs <- texts(cluster.corpus)[sample(x=1:nrow(cluster.corpus$documents), num.docs)]
return(cluster.docs)
}
And look at how these compare to some of the labels
printCorpusCluster(corpus, 1, num.docs=3, labels)
## [1] "mobility, occupational, intergenerational, class, origin, models, patterns, origins, career, industrial"
## Joel M. Podolny, James N. Baron 1997
## "We examine how the structure and content of individuals' networks in the workplace affect intraorganizational mobility. Consistent with prior research, we find that an individual's mobility is enhanced by having a large, sparse network of informal ties for acquiring information and resources. However, in contrast to previous work, we emphasize the importance of consistent role expectations for performance and mobility. We find evidence that well-defined performance expectations are more likely to arise from a small, dense network of individuals. We develop a typology of network contents and document the interaction between network structure and content in analyses of mobility among employees of a high-technology firm. We also examine how the effects of tie duration on mobility vary by tie content. We discuss the implications of our results for theory and research on networks and organizational mobility."
## C. Matthew Snipp 1985
## "This paper reviews several neo-Weberian ideas about social class for understanding patterns of men's career mobility. Special attention is devoted to the non-Marxist concepts of social class advocated by Giddens, Sorokin, and Parkin. Further insights are provided by an analysis of data for 20,000 men in the experienced American civilian labor force. The data are fit with a loglinear model and then further analyzed with multidimensional scaling techniques. The data analysis discloses three broad categories of social class that consist of farmers, manual, and nonmanual occupations. The findings also reveal further subdivisions in which professionals and skilled craft occupations form distinct classes within the nonmanual and manual categories. A third, much smaller, subdivision exists between semi-skilled operatives and unskilled laborers. These findings support neo-Weberian social class concepts. These results are also compared with Breiger's (1981) research on intergenerational mobility. This comparison shows that patterns in career mobility reveal a simpler picture of class structure than the eight class categories that Breiger reports for intergenerational mobility. The conceptual implications of these findings are discussed."
## Peter S. Bearman, Glenn Deane 1992
## "Models that allow the decomposition of mobility into its structural and exchange components are used to identify the structure, and consequences, of middle-class intergenerational mobility in preindustrial (1548-1689) Norwich, England. Dramatic shifts in the mobility opportunities of sons over time are seen to yield distinct political outcomes. Political stability is associated with almost universal upward mobility in the period from 1548 to 1589, while from 1590 to 1639 structural processes leading to massive downward mobility are associated with increased radicalization and participation in the English civil war. From 1640 to 1689, strata persistence in an unstable political context prevailed."
We might want to see how the two journals compare in terms of the clusters of documents.
corpus$documents$cluster <- kmeans.results.10$cluster
labels <- clusterLabels(mydfm, kmeans.results.10, 3)
# package the labels into a lookup dataframe so it can be merged into the corpus and used in the graph
label.df <- data.frame(cluster=1:length(labels), cluster.name=labels)
corpus$documents %>%
left_join(label.df) %>%
ggplot() +
geom_bar(aes(x=cluster.name, fill=journaltitle), position="dodge", width=0.75) +
xlab("Cluster") +
ylab("Number of Articles") +
# the labels are loooonnnngggg, so rotate them 45 degrees
theme(axis.text.x=element_text(angle=45, hjust=1))
## Joining, by = "cluster"
This section introduces the Structural Topic Model (stm) package. Most of it is drawn directly from the package’s Vignette, which goes through these and other features in even greater depth. I’ll also show how to fit a topic model using standard LDA through the Mallet package’s Gibbs sampling algorithm.
We’re going to be working with Eisenstein and Xing’s (2010) corpus of political blog posts from 2008. This will give us two covariates of interest: the liberal/conservative rating and the day during 2008 the blog was published.
We start by loading stm and the dataset. The textProcessor() function will use the tm package’s text processing tools to stem and remove stop words in the documents. It will return the documents divided into the documents, vocab, and meta formats that we saw working with the poliblog5k sample. Check ?textProcessor for options, as it allows you to change stemming options, alter the language used for stop word identification, or even pass in your own custom stop word list.
library(stm)
library(data.table)
poliblog <- as.data.frame(data.table::fread("poliblogs2008.csv"))
# default: removes stop words in english, stems words, removes words under 3 characters, puts everything to lower case
processed.docs <- textProcessor(poliblog$documents, metadata = poliblog)
## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Stemming...
## Creating Output...
str(processed.docs, list.len=10)
## List of 4
## $ documents :List of 13246
## ..$ 1 : int [1:2, 1:127] 10333 1 11536 1 11726 1 13018 1 13228 1 ...
## ..$ 2 : int [1:2, 1:189] 11860 1 12037 1 13139 1 13280 1 13768 1 ...
## ..$ 3 : int [1:2, 1:121] 10364 2 11139 1 12814 1 13228 1 13280 2 ...
## ..$ 4 : int [1:2, 1:135] 9815 1 10121 1 10234 1 10585 3 11703 1 ...
## ..$ 5 : int [1:2, 1:113] 10168 1 10513 1 11149 1 12500 1 13254 1 ...
## ..$ 6 : int [1:2, 1:203] 397 1 10513 1 10675 1 11248 1 11373 1 ...
## ..$ 7 : int [1:2, 1:172] 11536 1 11860 1 13114 1 13177 1 13768 2 ...
## ..$ 8 : int [1:2, 1:236] 187 1 10008 1 10207 1 11312 1 11821 1 ...
## ..$ 9 : int [1:2, 1:159] 3184 1 10364 1 10932 1 11749 2 12946 2 ...
## ..$ 10 : int [1:2, 1:176] 10802 1 11821 1 13095 1 13228 1 13567 4 ...
## .. [list output truncated]
## $ vocab : chr [1:123990] "---" "---backyard" "---bootstrap" "---box" ...
## $ meta :'data.frame': 13246 obs. of 6 variables:
## ..$ V1 : chr [1:13246] "1" "2" "3" "4" ...
## ..$ documents: chr [1:13246] "After a week of false statements, lies, and dismissive apologies, Pakistani President Pervez Musharraf now says"| __truncated__ "I honestly don't know how either party's caucus results will play out tonight. Usually, you can get a good idea"| __truncated__ "While we stand in awe of the willingness of our troops in Iraq to sacrifice themselves for the Nation, the fewe"| __truncated__ "These pages recently said goodbye to global warming. Ironically, the current spell of global warming, such as "| __truncated__ ...
## ..$ docname : chr [1:13246] "at0800300_1.text" "at0800300_2.text" "at0800300_3.text" "at0800300_4.text" ...
## ..$ rating : chr [1:13246] "Conservative" "Conservative" "Conservative" "Conservative" ...
## ..$ day : int [1:13246] 3 3 3 3 3 3 4 4 4 4 ...
## ..$ blog : chr [1:13246] "at" "at" "at" "at" ...
## $ docs.removed: int(0)
## - attr(*, "class")= chr "textProcessor"
As we saw above, we often want remove the rarer words from our corpus so the algorithm can focus on words that occur in many different documents. The prepDocuments() function will do that for us. The plotRemoved function will let us preview how many documents, words, and tokens we’ll remove by setting the threshold number of documents to various levels.
# let's look at 0 to 1% of the documents (130)
plotRemoved(processed.docs$documents, lower.thresh = seq(1, 130, by = 5))
It looks like the majority of the words this trims out are removed at around a threshold of 20 documents. We’ll go with that.
prepDocuments will handle reindexing the vocabulary and the metadata as words are removed. This is a royal pain to do by hand–shoulders of giants here.
prepped.docs <- prepDocuments(processed.docs$documents,
processed.docs$vocab,
processed.docs$meta,
lower.thresh = 20)
## Removing 115708 of 123990 terms (246536 of 2298953 tokens) due to frequency
## Your corpus now has 13246 documents, 8282 terms and 2052417 tokens.
Let’s take a quick look at the metadata in our dataset: the blog from which the post comes, and the day the post was made.
# Which blogs are which?
table(prepped.docs$meta$blog, prepped.docs$meta$rating)
##
## Conservative Liberal
## at 3197 0
## db 0 1879
## ha 3708 0
## mm 677 0
## tp 0 2080
## tpm 0 1705
# Conseratives and Liberals over time
table(prepped.docs$meta$rating, round(prepped.docs$meta$day/30, 0))
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12
## Conservative 252 418 576 645 737 625 643 662 771 667 638 547 401
## Liberal 187 392 374 288 333 424 477 493 589 693 644 463 307
# Blogs over time
table(prepped.docs$meta$blog, round(prepped.docs$meta$day/30, 0))
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12
## at 120 208 266 237 294 267 264 262 294 278 283 238 186
## db 78 118 103 110 131 114 149 155 206 235 216 156 108
## ha 98 146 268 359 352 299 319 357 404 333 310 267 196
## mm 34 64 42 49 91 59 60 43 73 56 45 42 19
## tp 76 163 176 84 100 181 196 188 215 224 212 160 105
## tpm 33 111 95 94 102 129 132 150 168 234 216 147 94
The stm function does the really heavy lifting, fitting a topic model to our data. There are more advanced settings for altering the Bayesian prior or the initialization algoritms used, but I’m going to focus on just a handful of features in this lab.
K sets the number of topics. As we’ll discuss in lecture, and as covered in the readings, choosing this is an important part of choosing a model.
Several of the options are worth mentioning/discussing here:
The overall topic prevalence (the percentage of a random docuemnt representing each possible topic) can be modeled as a function of covariates of the document. Here, we use the R model specification idioms to make it dependent on the rating (partisanship) of the blog author and a spline function of the day, s(day), which will allow the overall topic prevalence to change over the course of 2008.
The content =~ term would allow us to set a single categorical variable over which the word distibution of topics can be allowed to vary. In this case, we might allow each of the six different authors to speak about the same topic using slightly different distributions of words. We’re going to skip over this feature for this lab, but you can find out more about it in the various STM papers by Roberts et al. If you do explore it, keep in mind that it can have very large impacts on your model and requires that each category for which you want to allow separate content estimates contain a fairly large number of documents.
max.em.its tells stm the maximum number of iterations to use in the EM algorithm before giving up on it ever converging.
init.type=“Spectral” tells stm to initialize the EM algorithm using the spectral decomposition of the term co-ocurrence matrix. The spectral initialization uses a deterministic algorithm to find the best starting point for STM’s main EM algorithm. In many of the computational and human-judged ways that we’d want to compare models, it tends to produce better models than a random initialization.
poliblog.fit <-
stm(prepped.docs$documents,
prepped.docs$vocab,
K = 20, # number of topics
prevalence =~ rating + s(day),
max.em.its = 75,
data = prepped.docs$meta,
init.type = "Spectral")
save(poliblog.fit, file="poliblog_fit.Rdata")
load("poliblog_fit.Rdata")
The first step in exploring the topic model result is to see what’s going on in the topics so that we can start to get a sense of what features of the corpus they might reflect.
Calling plot.STM() on the fit object will show us the topics and their relative shares at a glance.
plot.STM(poliblog.fit)
Of course, these three-word labels don’t give us much insight into the model. Let’s dig into two of the topics to see what’s going on.
The labelTopics command will give us the top 7 words (by default) of each topic, using four different metrics for ranking the top words. If we’d included a content covariate, then we’d need to call sageLabels(poliblog.fit) to get the same information; it would average across the content groups to give both overall word lists for each topic and a word list for each content covariate category within each topic.
Prob ranks words by their raw prevalence in the topic, frex takes the harmonic mean of raw prevalence and exclusivity. Score is taken from the lda package and lift comes out of the multinomial inverse regression we looked at last week.
Personally, with this model (on this corpus), I find the FREX (frequency/exclusivity) lists to be the most useful. When we look at LDAVis below, this is similar to what the lambda slider does to the word lists.
labelTopics(poliblog.fit)
## Topic 1 Top Words:
## Highest Prob: obama, campaign, barack, mccain, biden, will, senat
## FREX: biden, barack, obama, joe, debat, campaign, team
## Lift: oct, goolsbe, biden, celeb, ifil, wurzelbach, obamabiden
## Score: oct, obama, barack, biden, campaign, mccain, ayer
## Topic 2 Top Words:
## Highest Prob: obama, mccain, poll, voter, democrat, state, lead
## FREX: poll, pennsylvania, virginia, percent, margin, elector, voter
## Lift: composit, zogbi, quinnipiac, gallup, todaygallup, outlier, outspend
## Score: composit, poll, mccain, obama, voter, rasmussen, percent
## Topic 3 Top Words:
## Highest Prob: one, time, get, stori, question, doesnt, media
## FREX: didnt, doesnt, reader, cant, isnt, blogger, blog
## Lift: bookshelf, livechat, registeredw, handychristma, ustream, watchupd, readership
## Score: registeredw, doesnt, didnt, media, exit, drudg, reader
## Topic 4 Top Words:
## Highest Prob: democrat, senat, republican, bill, vote, hous, legisl
## FREX: legisl, pelosi, reid, lieberman, bipartisan, nanci, speaker
## Lift: hoyer, legisl, steni, clotur, reid, pelosi, dcalif
## Score: legisl, republican, senat, vote, pelosi, reid, lieberman
## Topic 5 Top Words:
## Highest Prob: american, america, peopl, polit, will, countri, can
## FREX: wright, black, racist, pastor, reagan, liber, church
## Lift: saul, negro, reverend, perlstein, pfleger, farrakhan, sermon
## Score: saul, wright, church, black, jeremiah, pastor, racial
## Topic 6 Top Words:
## Highest Prob: elect, vote, voter, republican, state, ballot, democrat
## FREX: franken, ballot, coleman, registr, chambliss, minnesota, seat
## Lift: hagan, nrsc, chambliss, elia, gopheld, saxbi, wicker
## Score: elia, franken, coleman, ballot, registr, voter, vote
## Topic 7 Top Words:
## Highest Prob: one, will, film, allah, gay, christian, book
## FREX: film, allah, muhammad, gay, marriag, qur, hage
## Lift: aljalalayn, hartley, ibn, qur, yusuf, filmmak, kathir
## Score: meccan, film, allah, qur, muhammad, sura, ibn
## Topic 8 Top Words:
## Highest Prob: group, million, money, organ, chicago, school, public
## FREX: chicago, donat, school, rezko, student, ayer, donor
## Lift: voucher, daley, annenberg, dohrn, rezko, bernadin, bernardin
## Score: voucher, ayer, chicago, rezko, school, acorn, donat
## Topic 9 Top Words:
## Highest Prob: oil, energi, price, will, drill, gas, product
## FREX: drill, oil, energi, mugab, china, gas, price
## Lift: refineri, tsvangirai, mugab, thabo, zimbabw, mdc, opec
## Score: mbeki, oil, mugab, drill, energi, price, gas
## Topic 10 Top Words:
## Highest Prob: iran, israel, nuclear, terrorist, state, iranian, terror
## FREX: hama, palestinian, israel, isra, iran, iranian, nuclear
## Lift: damascus, fatah, iaea, lebanes, assad, bashar, hama
## Score: fatah, iran, israel, hama, iranian, isra, palestinian
## Topic 11 Top Words:
## Highest Prob: global, warm, abort, chang, studi, research, climat
## FREX: warm, abort, climat, scientist, scienc, scientif, birth
## Lift: dioxid, embryon, ipcc, alarmist, scientif, anthropogen, scientist
## Score: embryon, abort, global, warm, climat, emiss, scienc
## Topic 12 Top Words:
## Highest Prob: iraq, war, iraqi, militari, troop, will, forc
## FREX: iraqi, iraq, maliki, petraeus, sadr, troop, baghdad
## Lift: almaliki, alsadr, basra, centcom, drawdown, fallon, nouri
## Score: fallon, iraq, iraqi, troop, maliki, sadr, petraeus
## Topic 13 Top Words:
## Highest Prob: clinton, hillari, will, democrat, primari, campaign, parti
## FREX: hillari, clinton, romney, huckabe, deleg, primari, edward
## Lift: superdel, bylaw, superdeleg, wolfson, rodham, frontrunn, huckabe
## Score: bylaw, hillari, clinton, romney, deleg, huckabe, superdeleg
## Topic 14 Top Words:
## Highest Prob: law, court, investig, tortur, case, state, depart
## FREX: tortur, blagojevich, interrog, detaine, attorney, court, justic
## Lift: habea, antonin, mukasey, siegelman, addington, yoo, guantánamo
## Score: addington, blagojevich, attorney, tortur, detaine, court, interrog
## Topic 15 Top Words:
## Highest Prob: govern, will, attack, russia, pakistan, georgia, russian
## FREX: russian, musharraf, pakistan, pakistani, taliban, russia, nato
## Lift: benazir, bhutto, isi, islamabad, ossetia, putin, tbilisi
## Score: gori, pakistan, russian, russia, taliban, musharraf, pakistani
## Topic 16 Top Words:
## Highest Prob: think, like, know, say, peopl, just, thing
## FREX: linktocommentspostcount, postcounttb, matthew, guy, digbi, oreilli, think
## Lift: somerbi, tina, digbyim, tweeti, gasbag, dooci, macho
## Score: tina, linktocommentspostcount, postcounttb, matthew, oreilli, think, digbi
## Topic 17 Top Words:
## Highest Prob: mccain, palin, john, said, sen, campaign, sarah
## FREX: palin, sarah, raz, mccain, fox, alaska, sen
## Lift: holtzeakin, pfotenhau, whiner, scriptalreadyrequest, acflruncont, allowfullscreentru, allowscriptaccessalway
## Score: whiner, mccain, palin, raz, sarah, sen, alaska
## Topic 18 Top Words:
## Highest Prob: said, citi, polic, report, immigr, offic, kill
## FREX: polic, immigr, taser, incid, flight, san, protest
## Lift: taser, berkeley, deport, capt, vandal, passeng, escort
## Score: taser, polic, immigr, citi, murder, spitzer, arrest
## Topic 19 Top Words:
## Highest Prob: tax, will, economi, econom, govern, plan, american
## FREX: tax, mortgag, financi, fanni, billion, economi, loan
## Lift: homeownership, mortgageback, lender, efca, fdic, gses, mae
## Score: efca, tax, mortgag, bailout, billion, economi, loan
## Topic 20 Top Words:
## Highest Prob: bush, presid, administr, hous, white, said, polici
## FREX: bush, cheney, presid, administr, secretari, georg, white
## Lift: methodist, perino, podesta, mcclellan, cheney, powel, bush
## Score: methodist, bush, presid, administr, cheney, perino, white
STM will also help us call up representative documents to read to make sure our topics mean what we think they mean. The findThoughts() function will pull up the top n documents associated with each topic.
# our readable document text is buried in the $meta object, and stored as a factor
doc.text <- as.character(prepped.docs$meta$documents)
sample.docs <- findThoughts(poliblog.fit,
texts = doc.text,
n = 3,
topics = c(1,3,11))
# the first topic we grabbed: 1, Obama
sample.docs$docs[[1]]
## [1] "At a press conference just now, Barack Obama rejected John McCain's demand for a suspension of the debate. \"\"I believe we should continue to have the debate,\"\" he just said. \"\"I believe it makes sense for us to present ourselves to the American people.\"\" \"\"Obviously if it turns out that we need to be in Washington, we've both got big planes, we've painted our slogan on the side of them,\"\" Obama also said. \"\"They can get us from Washington to Mississippi pretty quickly.\"\" The debate is set to take place in Mississippi. Obama also said that he was blindsided by McCain's public call for a debates suspension. After describing their conversation about a possible suspension, Obama said: \"\"I thought that this was something that he was mulling over. Apparently this was something that he was more decisive about in his own mind.\"\" Obama described their conversation as follows: \"\"I proposed putting out the joint statement. He concurred with that. he then also said, 'I would like us to look at suspending the campaign and pushing the debates off.' I said, 'let's put out the joint statement first, and then get our campaigns to discuss this.'\"\" Obama said he later saw McCain announcing his plans on television. One more time: If this version of events is true, McCain's public call for a suspension was anything but apolitical. If McCain had truly intended to keep this apolitical, he would have asked Obama to jointly suspend the debates, made his own full intentions clear, and waited for Obama's private and definitive answer before going public."
## [2] "Hillary is already out with a statement hitting Obama over the Associated Press story reporting that a Canadian official said in a memo that Obama economic adviser Antan Goolsbee privately suggested that Obama's NAFTA stump talk was \"\"more about political positioning than a clear articulation of policy plans.\"\" Here's her statement... I think that after days of denial, the Obama campaign was confronted with a memo of a meeting -- it was my understanding -- in which there was a discussion of NAFTA. And it raises questions about Senator Obama coming to Ohio and giving speeches about NAFTA and having his chief economic advisor tell the Canadian government that it was just political rhetoric. I donâ\u0080\u0099t think people should come to Ohio and tell the people of Ohio one thing and then have your campaign tell a foreign government something else behind closed doors. Goolsbee has disputed the memo's characterization of the conversation, saying: \"\"In no possible way was I inferring that he was going to introduce any policies that you should ignore and he had no intention of enacting.\"\" It's unclear how this latest turn in the whole saga will play in Ohio, where NAFTA is a highly-charged issue, or whether it's coming too late to move votes there."
## [3] "The McCain campaign hits Obama over Wes Clark's comments for the third day running, releasing yet another statement... \"\"Yesterday, Barack Obama's campaign said he rejected Gen. Clark's attack on John McCain's military service. But last night, Gen. Clark admitted to speaking with the Obama campaign, and then went out and repeated his attacks. It's clear that the Obama campaign isn't telling Wes Clark to apologize, and are either encouraging or tolerating his attacks on John McCain's military service. The Obama campaign even said they were 'glad' that Gen. Clark 'clarified' a comment they supposedly repudiated. If this kind of wink-and-nod game is how Barack Obama wants to run his campaign, then fine. But spare us the empty talk of 'new politics' and raising the dialogue in this country. We just wonder: Will Barack Obama's actions ever match his words?\"\" It's really unclear what exactly the McCain campaign is pretending to be outraged about. Whatever the wisdom of Obama's repudiation of Clark's comments, this is pretty simple: Obama doesn't stand by Clark's comments, and Clark does. Clark communicated as much to the Obama campaign. How does this prove that Obama is \"\"encouraging or tolerating\"\" the attacks? It doesn't, of course. At this point McCain advisers are simply manufacturing outrage, secure in the knowledge that they won't get called out for it. Oh, and by the way, Clark didn't \"\"attack\"\" McCain's military service."
# the second topic: 3, meta-media
sample.docs$docs[[2]]
## [1] "The Northern Alliance Radio Network will be on the air today, with our eight-hour-long broadcast schedule starting at 9 am CT. If you’re in the Twin Cities, you can hear us on AM 1280 The Patriot, or on the station’s Internet stream if you’re outside of the broadcast area.Today, I’m traveling to the Left Coast for the holidays — and to see Notre Dame play USC at the Colisseum — but Mitch will have all the thrills and chills of our usual show. We won’t have a live video stream, but the chat room will be up & operational, as long as Ustream doesn’t flop like it did yesterday.Be sure to call 651-289-4488 to join the conversation!Mitch and I now live-stream our show from 1-3 pm CT on Ustream, complete with live chat! Be sure to register for free at Ustream to participate in our raucous live-chat sessions. (And if the log-in prompt doesn’t come up in the chat box below, use this link instead.)Want to see all episodes of The Ed Morrissey Show? Add the RSS feed to your reader and keep them handy!Christmas season is coming! Why not shop through my Amazon widgets? Get great deals delivered to your door. Check out Ed’s Bookshelf or search for what you want:<!--</p> <p>// --><!--</p> <p>// -->"
## [2] "Today, on the Ed Morrissey Show (3 pm ET), Duane “Generalissimo” Patterson of the Hugh Hewitt Show joins us for our Week in Review. However, I suspect most of our time will be spent on analyzing last night’s debate between Sarah Palin and Joe Biden. We’ll save some time to preview Hugh’s show tonight, after his marathon 6-hour broadcast last night.Special note: Today is my 5-year blogiversary! I started Captain’s Quarters on October 2nd, 2003, and I’ve blogged constantly since. I can’t think of a better way to celebrate it than with my friends at TEMS. Big thanks to Jim Lynch, who first reminded me of it, and to Sister Toldjah, who is also celebrating Blogiversary #5 today.Now you can join the conversation in the chat room! Be sure to register at Ustream to participate in our raucous live-chat sessions. (And if the log-in prompt doesn’t come up in the chat box below, use this link instead.) Jazz Shaw of The Moderate Voice moderates the chat, and he has his Troll Gun Trivia Contest running before the show, for those who get there early — and are registered!Want to see all episodes of The Ed Morrissey Show? Add the RSS feed to your reader and keep them handy!Opening animation from TG Studios."
## [3] "Today, on the Ed Morrissey Show (3 pm ET), we’re thrilled to welcome back Mary Katharine Ham. MK and I will talk about the latest developments in the Blago-Rahma scandal, the Joe Scarborough takedown of the national media, and much more. In the first half of the show, we will talk with Lila Rose, the young woman who has gone undercover to expose the practices of Planned Parenthood in a series of videos. Don’t miss this!!Update: This was one of my favorite shows. Be sure to watch it on replay below!Now you can join the conversation in the chat room! Be sure to register at Ustream to participate in our raucous live-chat sessions. (And if the log-in prompt doesn’t come up in the chat box below, use this link instead.) Jazz Shaw of The Moderate Voice moderates the chat, and he has his Troll Gun Trivia Contest running before the show, for those who get there early — and are registered!Want to see all episodes of The Ed Morrissey Show? Add the RSS feed to your reader and keep them handy!Christmas season is coming! Why not shop through my Amazon widgets? Get great deals delivered to your door. Check out Ed’s Bookshelf or search for what you want: Opening animation from TG Studios."
# and the third topic: 17, McCain
sample.docs$docs[[3]]
## [1] "Is the scientific debate over on global warming? Not according to the American Physical Society* in this year's July's issue of Physics and Society .\"\"With this issue of Physics & Society, we kick off a debate concerning one of the main conclusions of the International Panel on Climate Change (IPCC), the UN body which, together with Al Gore, recently won the Nobel Prize for its work concerning climate change research. There is a considerable presence within the scientific community of people who do not agree with the IPCC conclusion that anthropogenic CO2 emissions are very probably likely to be primarily responsible for the global warming that has occurred since the Industrial Revolution. Since the correctness or fallacy of that conclusion has immense implications for public policy and for the future of the biosphere, we thought it appropriate to present a debate within the pages of P&S concerning that conclusion. This editor invited several people to contribute articles that were either pro or con. Christopher Monckton responded ...\"\" [Emphasis added.]And what did Lord Monckton say?\"\"Some reasons why the IPCC's estimates may be excessive and unsafe are explained. More importantly, the conclusion is that, perhaps, there is no \"\"climate crisis\"\", and that currently-fashionable efforts by governments to reduce anthropogenic CO2 emissions are pointless, may be ill-conceived, and could even be harmful.\"\"He examined specific assumptions of the IPCC cited computer models and found that, even using the same models but with more justifiable assumptions, carbon dioxide is not a critical threat to global temperatures.\"\"Theoretically, empirically, and in the literature that we have extensively cited, each of the values we have chosen as our central estimate is arguably more justifiable - and is certainly no less justifiable - than the substantially higher value selected by the IPCC. Accordingly, it is very likely that in response to a doubling of pre-industrial carbon dioxide concentration TS will rise not by the 3.26 °K suggested by the IPCC, but by <1 °K.\"\"He concluded with \"\"If the concluding equation in this analysis is correct, the IPCC's estimates of climate sensitivity must have been very much exaggerated. There may, therefore, be a good reason why, contrary to the projections of the models on which the IPCC relies, temperatures have not risen for a decade and have been falling since the phase-transition in global temperature trends that occurred in late 2001. Perhaps real-world climate sensitivity is very much below the IPCC's estimates. Perhaps, therefore, there is no \"\"climate crisis\"\" at all. At present, then, in policy terms there is no case for doing anything. The correct policy approach to a non-problem is to have the courage to do nothing.\"\" [Emphasis added.]*According to Wikipedia http://en.wikipedia.org/wiki/American_Physical_Society, the American Physical Society was founded in 1899 and is the second largest association of physicists in the world, with over 40,000 members. Update: According to http://www.aps.org/APS Climate Change StatementAPS Position Remains UnchangedThe American Physical Society reaffirms the following position on climate change, adopted by its governing body, the APS Council, on November 18, 2007:\"\"Emissions of greenhouse gases from human activities are changing the atmosphere in ways that affect the Earth's climate.\"\"An article at odds with this statement recently appeared in an online newsletter of the APS Forum on Physics and Society, one of 39 units of APS. The header of this newsletter carries the statement that \"\"Opinions expressed are those of the authors alone and do not necessarily reflect the views of the APS or of the Forum.\"\" This newsletter is not a journal of the APS and it is not peer reviewed. Read: APS Climate Change Statement"
## [2] "Climate change report forecasts global sea levels to rise up to 4 feet by 2100. \t\t\t\t\tAccording to a new report led by the U.S. Geological Survey, the U.S. “faces the possibility of much more rapid climate change by the end of the century than previous studies have suggested.” The report, commissioned by the U.S. Climate Change Science Program, found that global sea levels could rise higher than a 2007 U.N. Intergovernment Panel on Climate Change (IPCC) study had concluded: In one of the report’s most worrisome findings, the agency estimates that in light of recent ice sheet melting, global sea levels could rise as much as 4 feet by 2100. The intergovernment panel had projected a rise of no more than 1.5 feet by that time, but satellite data over the last two years show the world’s major ice sheets are melting much more rapidly than previously thought. The Antarctic and Greenland ice sheets are losing an average of 48 cubic miles of ice a year, equivalent to twice the amount of ice in the Alps. The lead scientist for the report’s chapter on ice sheets said the models used by the IPCC “did not factor in some of the dynamics that scientists now understand about ice sheet melting” such as “a process of ‘lubrication,’ in which warmer ocean water gets underneath coastal ice sheets and accelerates melting.” \t Update The Wonk Room's Brad Johnson has more on ice sheet melting."
## [3] "Deathly news for the religion of Global Warming. Looks like at least one prominent scientific group has changed its mind about the irrefutability of evidence regarding man made climate change.The American Physical Society representing nearly 50,000 physicists \"\"has reversed its stance on climate change and is now proclaiming that many of its members disbelieve in human-induced global warming,\"\" according to an article in The Daily Tech. The leadership of the society had previously referred to global warming evidence as \"\"incontravertible.\"\"What does this mean? It doesn't mean that the APS has abandoned its position on man made climate change. What they have done is perhaps more significant; they have opened the subject up for debate once again. And once the facts can be judged on their merits and not on who is submitting them, at the very least global warming deniers have a powerful platform to present their findings.And the first shot across the bow is directed at the Nobel Prize winning group, the IPPC which one scientist discovered has been fudging its studies to hide the fact that increased C02 does not necessarily lead to increased temps:The APS is opening its debate with the publication of a paper by Lord Monckton of Brenchley, which concludes that climate sensitivity -- the rate of temperature change a given amount of greenhouse gas will cause -- has been grossly overstated by IPCC modeling. A low sensitivity implies additional atmospheric CO2 will have little effect on global climate.Larry Gould, Professor of Physics at the University of Hartford and Chairman of the New England Section of the APS, called Monckton's paper an \"\"expose of the IPCC that details numerous exaggerations and \"\"extensive errors\"\"In an email to DailyTech, Monckton says, \"\"I was dismayed to discover that the IPCC's 2001 and 2007 reports did not devote chapters to the central 'climate sensitivity' question, and did not explain in proper, systematic detail the methods by which they evaluated it. When I began to investigate, it seemed that the IPCC was deliberately concealing and obscuring its method.\"\" According to Monckton, there is substantial support for his results, \"\"in the peer-reviewed literature, most articles on climate sensitivity conclude, as I have done, that climate sensitivity must be harmlessly low.\"\"Monkton attributes temperature rise to \"\"natural variability\"\" which has been a key argument of the deniers for years. This is only one scientific organization but an important one. And while such news is unlikely to reverse the political momentum toward draconian changes, it may slow the march of science toward assisting the charlatans like Al Gore in their efforts.If so, it is big news indeed."
As before, we can also use word clouds to get snapshot overviews of our topics of interest.
cloud(poliblog.fit, topic=1)
cloud(poliblog.fit, topic=3)
## Warning in wordcloud::wordcloud(words = vocab, freq = vec, max.words =
## max.words, : doesnt could not be fit on page. It will not be plotted.
cloud(poliblog.fit, topic=17)
It can also be illustrative to directly compare topics of interest in terms of their vocabulary to make sure that what we think are the distinguishing characteristic of each actually do distinguish them. plot.STM() has a plot type=“perspectives” that does exactly this.
plot.STM(poliblog.fit, type="perspectives", topics=c(3,11))
Those two are pretty uninteresting because of how much the two candidates dominate them. How about topic 20: bush-presid-admin and topic 4: democrat-senate-republican?
plot.STM(poliblog.fit, type="perspectives", topics=c(4,20))
If we had run our model with a content covariate, then calling plot.STM(…, type=“perspectives”) with a single topic number would have let us see how different context categories talk about that topic differently.
Finally, Sievert and Shirley’s (2014) LDAvis package is an excellent tool for taking a high-level tour of your topic model’s output.
Each topic will be represented by a circle whose size is determined by the topic’s overall prevalence. It uses a multi-dimensional scaling algorithm to place the topics using KL-divergence in their distributions over words. What this means is that topics that are close to one another in the display have more similar words than topics that are far apart.
On the right is a list of the top 30 (by default) terms overall. If you click on a topic, it will update this list to the top 30 in the topic selected, and show you the frequency of the word in the topic as well as in the corpus as a whole. By default, the order of these lists is determined by the raw word freqeucies. In the upper right is a slider for lambda, which will allow you to trade off the weight between raw word frequecnies in a topic (lambda = 1) and adjusting for the overall word frequency (lambda = 0). The authors recommend (based on testing with human coders) a value of 0.6 for most uses.
Setting out.dir = “” will output the javascript and html neceessary to run LDAvis to the directory you specify. You can also leave this term out and it will just run as a local server in the “viewer” of RStudio.
toLDAvis(poliblog.fit,
docs=prepped.docs$documents,
out.dir = "poliblog_LDAvis")
image:
estimateEffect and plot.estimateEffect are tools in STM for investigating how our topics relate to covariates of interest.
The first step is estimate model effects from our fit object. The estimateEffect function takes a formula (we need to specify which topics as the outcome variable), the model fit to predict from, the meta data, and the type of uncertainty estimate to use. The default uncertainty setting, “Global,” incorporates topic proportion uncertainties into the displayed uncertainty estimates.
We’ll run this using the same formula with which we fit the model so we can recover both the rating and time dimensions.
fit.effects <- estimateEffect(1:20 ~ rating + s(day),
poliblog.fit,
meta = prepped.docs$meta,
uncertainty = "Global")
We could first get an overview of the topics on the liberal-conservative dimension.
plot.estimateEffect(fit.effects,
covariate = "rating",
topics = 1:20,
model = poliblog.fit,
method = "difference",
cov.value1 = "Liberal", cov.value2 = "Conservative",
xlab = "More Conservative ... More Liberal",
main = "Effect of Liberal vs. Conservative",
xlim = c(-.1, .1),
labeltype = "frex", # use stm's "frex" to rank top words
n = 3, # only list the top 3 words
verbose.labels = F, # labels get spammy with this T
width = 35 # allow for longer label strings
)
Interesting that the “meta-media” topic, 3, is the furthest partisan. Perhaps this is because one of the blogs in particular most often writes about television broadcasts?
We can run a new estimateEffect where we look at variation in just this topic by the blog name. Note that we can use a different model in estimateEffect than we fit in stm. That let’s us look at how topic prevelance varies by a covariate, even if that covariate was not part of our model structure.
blog.effects <- estimateEffect(c(3) ~ blog,
poliblog.fit,
meta = prepped.docs$meta,
uncertainty = "Global")
plot.estimateEffect(blog.effects,
covariate = "blog",
topics = 3,
model = poliblog.fit,
method = "pointestimate",
verbose.labels=T)
As we get closer to publishing, we probably want to clean up our graph and add custom labels using the labeltype = “custom” feature of plot.estimateEffect.
plot.estimateEffect(fit.effects,
covariate = "rating",
topics = c(1, 10, 18, 13, 17, 20),
model = poliblog.fit,
method = "difference",
cov.value1 = "Liberal", cov.value2 = "Conservative",
xlab = "More Conservative ... More Liberal",
main = "Topic Prevalence Liberal vs. Conservative",
xlim = c(-.1, .1),
labeltype = "custom",
custom.labels = c('Obama', 'Israel',
'Immigration',
'Hillary Clinton',
'McCain/Palin',
'Bush'))
We can also plot against a continuous variable (such as the “day” variable in poliblog) using method = “continuous”.
plot.estimateEffect(fit.effects,
covariate = "day",
method = "continuous",
topics = c(1,17),
model = poliblog.fit,
printlegend = TRUE,
xlab = "Time (2008)")
This looks a lot like the conservative/liberal frequency graphs we saw when using poliblog for supervised learning – although it’s Topic 1 (Obama) that maps to the conservative frequency and Topic 17 (McCain) that looks like the liberal frequency.
To investigate that further, we want to allow prevalence to change not only over time and by liberal/conservative, but allow those two categories to have different temporal patterns in prevalence. STM allows us to specify this as a possibility by setting an interaction in the prevalence formulation the same way we would in a regression model. Running this with “Global” uncertainty can take awhile, so here we’ll run just a couple of topics of interest.
First, we fit the estimateEffects for the interaction formulation.
interact.effects <- estimateEffect(c(1,17) ~ rating * s(day),
poliblog.fit,
metadata = prepped.docs$meta,
uncertainty = "Global")
Then we can take a look at Topic 1
plot.estimateEffect(interact.effects,
topics = 1,
covariate = "day",
model = poliblog.fit,
method = "continuous",
moderator = "rating",
moderator.value = "Liberal",
linecol = "blue",
printlegend = F)
plot.estimateEffect(interact.effects,
topics = 1,
covariate = "day",
model = poliblog.fit,
method = "continuous",
moderator = "rating",
moderator.value = "Conservative",
linecol = "red",
add = T,
printlegend = F)
legend(150, .02, c("Liberal", "Conservative"), lwd = 2, col = c("blue", "red"))
So the Topic 1 differences isn’t driving the overall difference in liberal-conservative frequency we observed. How about Topic 17?
plot.estimateEffect(interact.effects,
topics = 17,
covariate = "day",
model = poliblog.fit,
method = "continuous",
moderator = "rating",
moderator.value = "Liberal",
linecol = "blue",
printlegend = F)
plot.estimateEffect(interact.effects,
topics = 17,
covariate = "day",
model = poliblog.fit,
method = "continuous",
moderator = "rating",
moderator.value = "Conservative",
linecol = "red",
add = T,
printlegend = F)
legend(0, .14, c("Liberal", "Conservative"), lwd = 2, col = c("blue", "red"))
So, Liberal blogs appear to rapidly increase their discussion of McCain in the second half of the year.
In this Lab, we’ll look at two methods for scaling documents or authors in a unidimensional spatial model. Wordscores and Wordfish both seek to extract a single latent dimension of each text author that we consider to be motivating the author’s word choice. Wordscores is a supervised approach, requiring us to specify the pair or set of training documents that will define the dimension. Wordfish is an unsupervised approach that will extract a latent dimension, though as we’ll see, it’s not always the dimension we’re interested in.
We’ll also take a brief look at correspondence analysis, a method that will allow us to fit a multi-dimensional model to a set of documents.
We’ll be using the quanteda package for most of the work herein – it contains implementations of Wordscore, Wordfish, Naive Bayes, and Correspondence Analysis.
library(quanteda)
We’ll start with the simple, non-parametric Wordscores approach.
First, let’s take a look at U.S. presidential inaugural speeches during the 20th Century. The full corpus of speeches in included in the quanteda package.
summary(data_corpus_inaugural)
## Corpus consisting of 58 documents:
##
## Text Types Tokens Sentences Year President FirstName
## 1789-Washington 625 1538 23 1789 Washington George
## 1793-Washington 96 147 4 1793 Washington George
## 1797-Adams 826 2578 37 1797 Adams John
## 1801-Jefferson 717 1927 41 1801 Jefferson Thomas
## 1805-Jefferson 804 2381 45 1805 Jefferson Thomas
## 1809-Madison 535 1263 21 1809 Madison James
## 1813-Madison 541 1302 33 1813 Madison James
## 1817-Monroe 1040 3680 121 1817 Monroe James
## 1821-Monroe 1259 4886 129 1821 Monroe James
## 1825-Adams 1003 3152 74 1825 Adams John Quincy
## 1829-Jackson 517 1210 25 1829 Jackson Andrew
## 1833-Jackson 499 1269 29 1833 Jackson Andrew
## 1837-VanBuren 1315 4165 95 1837 Van Buren Martin
## 1841-Harrison 1896 9144 210 1841 Harrison William Henry
## 1845-Polk 1334 5193 153 1845 Polk James Knox
## 1849-Taylor 496 1179 22 1849 Taylor Zachary
## 1853-Pierce 1165 3641 104 1853 Pierce Franklin
## 1857-Buchanan 945 3086 89 1857 Buchanan James
## 1861-Lincoln 1075 4006 135 1861 Lincoln Abraham
## 1865-Lincoln 360 776 26 1865 Lincoln Abraham
## 1869-Grant 485 1235 40 1869 Grant Ulysses S.
## 1873-Grant 552 1475 43 1873 Grant Ulysses S.
## 1877-Hayes 831 2716 59 1877 Hayes Rutherford B.
## 1881-Garfield 1021 3212 111 1881 Garfield James A.
## 1885-Cleveland 676 1820 44 1885 Cleveland Grover
## 1889-Harrison 1352 4722 157 1889 Harrison Benjamin
## 1893-Cleveland 821 2125 58 1893 Cleveland Grover
## 1897-McKinley 1232 4361 130 1897 McKinley William
## 1901-McKinley 854 2437 100 1901 McKinley William
## 1905-Roosevelt 404 1079 33 1905 Roosevelt Theodore
## 1909-Taft 1437 5822 159 1909 Taft William Howard
## 1913-Wilson 658 1882 68 1913 Wilson Woodrow
## 1917-Wilson 549 1656 59 1917 Wilson Woodrow
## 1921-Harding 1169 3721 148 1921 Harding Warren G.
## 1925-Coolidge 1220 4440 196 1925 Coolidge Calvin
## 1929-Hoover 1090 3865 158 1929 Hoover Herbert
## 1933-Roosevelt 743 2062 85 1933 Roosevelt Franklin D.
## 1937-Roosevelt 725 1997 96 1937 Roosevelt Franklin D.
## 1941-Roosevelt 526 1544 68 1941 Roosevelt Franklin D.
## 1945-Roosevelt 275 647 26 1945 Roosevelt Franklin D.
## 1949-Truman 781 2513 116 1949 Truman Harry S.
## 1953-Eisenhower 900 2757 119 1953 Eisenhower Dwight D.
## 1957-Eisenhower 621 1931 92 1957 Eisenhower Dwight D.
## 1961-Kennedy 566 1566 52 1961 Kennedy John F.
## 1965-Johnson 568 1723 93 1965 Johnson Lyndon Baines
## 1969-Nixon 743 2437 103 1969 Nixon Richard Milhous
## 1973-Nixon 544 2012 68 1973 Nixon Richard Milhous
## 1977-Carter 527 1376 52 1977 Carter Jimmy
## 1981-Reagan 902 2790 128 1981 Reagan Ronald
## 1985-Reagan 925 2921 123 1985 Reagan Ronald
## 1989-Bush 795 2681 141 1989 Bush George
## 1993-Clinton 642 1833 81 1993 Clinton Bill
## 1997-Clinton 773 2449 111 1997 Clinton Bill
## 2001-Bush 621 1808 97 2001 Bush George W.
## 2005-Bush 773 2319 100 2005 Bush George W.
## 2009-Obama 938 2711 110 2009 Obama Barack
## 2013-Obama 814 2317 88 2013 Obama Barack
## 2017-Trump 582 1660 88 2017 Trump Donald J.
##
## Source: Gerhard Peters and John T. Woolley. The American Presidency Project.
## Created: Tue Jun 13 14:51:47 2017
## Notes: http://www.presidency.ucsb.edu/inaugurals.php
# let's look at 20th- and 21st-century presidents
inaug.dfm <- dfm(corpus_subset(data_corpus_inaugural, Year>1901), remove_punct=TRUE)
For Wordscores, we need to specify at least two anchors that will form the end points of our single-dimension reduction of the data. Let’s start with Roosevelt’s 1937 speech our left-most and Reagan’s first speech our right-most points.
docnames(inaug.dfm)
## [1] "1905-Roosevelt" "1909-Taft" "1913-Wilson"
## [4] "1917-Wilson" "1921-Harding" "1925-Coolidge"
## [7] "1929-Hoover" "1933-Roosevelt" "1937-Roosevelt"
## [10] "1941-Roosevelt" "1945-Roosevelt" "1949-Truman"
## [13] "1953-Eisenhower" "1957-Eisenhower" "1961-Kennedy"
## [16] "1965-Johnson" "1969-Nixon" "1973-Nixon"
## [19] "1977-Carter" "1981-Reagan" "1985-Reagan"
## [22] "1989-Bush" "1993-Clinton" "1997-Clinton"
## [25] "2001-Bush" "2005-Bush" "2009-Obama"
## [28] "2013-Obama" "2017-Trump"
# so, we want the 9th document and the 20th
# scores tells wordscores the end points of the scale
fit <- textmodel_wordscores(inaug.dfm[c(9,20)],y=c(-1,1))
# Then we can apply our fitted model to the full data set
word.scores <- predict(fit, newdata=inaug.dfm)
# see below regarding the "@" symbol
ws.data.frame <- word.scores@textscores
ggplot(ws.data.frame, aes(x=rownames(ws.data.frame), y=textscore_raw)) +
geom_errorbar(aes(ymin=textscore_raw_lo, ymax=textscore_raw_hi), width=.1) +
geom_point() +
theme(axis.text.x=element_text(angle=45, hjust=1))
Note the message from predict that only a subset of the features can be scored when we run predict. This means that only that subset of words appears in one of the training documents – the remaining words in the feature set appear in other documents, but not the two we train on.
It’s likely that wordscore will put a lot of emphasis on words that appear in one of the training docs, but not in the other. In our case, each doc has ~450 words, but they only share 245 out of the 682 total between them.
# distinct words in Roosevelt
sum(as.numeric(inaug.dfm[9,])>0)
## [1] 683
# distinct words in Reagan
sum(as.numeric(inaug.dfm[20,])>0)
## [1] 843
# distinct words in both
sum(as.numeric(inaug.dfm[20,])>0 & as.numeric(inaug.dfm[9,])>0)
## [1] 256
# distinct words in either
sum(as.numeric(inaug.dfm[20,])>0 | as.numeric(inaug.dfm[9,])>0)
## [1] 1270
Above, we accessed the results from Wordscore using the @ symbol instead of the usual $ in **word.scores@textscores**. Many R objects are built with S3, which mimics some of the properties of an object-oriented class system. However, when full class functionality is required, the objects are built in S4 (deployed through the methods package). S3 objects (which include lists and data.frames) use the $ notation, while S4 objects use the @. If you’re more interested in the differences between the two approaches, this presentation by Friedrich Leisch offers a good orientation.
Let’s see what happens if we pin down the speeches with other end points. Maybe there was something idiosyncratic about this pair.
This is going to be easier to try out different endpoints if we wrap the call in a function.
wordScoreMe <- function(df, indices, y=c(-1,1)) {
fit <- textmodel_wordscores(df[indices,], y=y)
# Then we can apply our algorithm to the full data set
word.scores <- predict(fit, newdata=df)
ws.data.frame <- word.scores@textscores
ggplot(ws.data.frame, aes(x=rownames(ws.data.frame), y=textscore_raw)) +
geom_errorbar(aes(ymin=textscore_raw_lo, ymax=textscore_raw_hi), width=.1) +
geom_point() +
theme(axis.text.x=element_text(angle=45, hjust=1))
}
And now we can easily produce the scaled graph for different end points.
# how about the first roosevelt speech and second reagan?
wordScoreMe(inaug.dfm, c(19,20))
# reagan and obama?
wordScoreMe(inaug.dfm, c(27,21))
Each time, we see the same pattern: our two specified end points (training documents) are placed at the extremes, and everything else clusters close to zero. This is being driven in part by two things that point to the sort of corpuses spatial models will work better with.
First, these documents are relatively short for each author, such that we get a small amount of overlapping words between them. This gives the models little to work with in pulling a dimension of difference out, and we’ll tend to emphasize the small set of things that distinguish the particular pair (or in the case of Wordfish, the single outlier, as we’ll see below). What Wordscores is essentially doing is ranking documents by how similar they are to the training documents at either end of the spectrum. Try it out with a larger dataset.
The other issue is that although these Presidents are all speaking at the same formal event, they are potentially talking about very different things. Maybe there’s not enough consistency in topical focus across the inauguration speeches to map them into a single dimension. If we have a very narrow scope of potential discourse, then we may be more likely to find a reliable mapping to a single dimension even without lots of words per author. We’ve seen one example of such a narrowly-proscribed set of documents: the 2010 set of budget speeches by Irish parliamentary leaders used in Lowe and Benoit (2013).
summary(data_corpus_irishbudget2010)
## Corpus consisting of 14 documents:
##
## Text Types Tokens Sentences year debate
## 2010_BUDGET_01_Brian_Lenihan_FF 1953 8641 374 2010 BUDGET
## 2010_BUDGET_02_Richard_Bruton_FG 1040 4446 217 2010 BUDGET
## 2010_BUDGET_03_Joan_Burton_LAB 1624 6393 307 2010 BUDGET
## 2010_BUDGET_04_Arthur_Morgan_SF 1595 7107 343 2010 BUDGET
## 2010_BUDGET_05_Brian_Cowen_FF 1629 6599 250 2010 BUDGET
## 2010_BUDGET_06_Enda_Kenny_FG 1148 4232 153 2010 BUDGET
## 2010_BUDGET_07_Kieran_ODonnell_FG 678 2297 133 2010 BUDGET
## 2010_BUDGET_08_Eamon_Gilmore_LAB 1181 4177 201 2010 BUDGET
## 2010_BUDGET_09_Michael_Higgins_LAB 488 1286 44 2010 BUDGET
## 2010_BUDGET_10_Ruairi_Quinn_LAB 439 1284 59 2010 BUDGET
## 2010_BUDGET_11_John_Gormley_Green 401 1030 49 2010 BUDGET
## 2010_BUDGET_12_Eamon_Ryan_Green 510 1643 90 2010 BUDGET
## 2010_BUDGET_13_Ciaran_Cuffe_Green 442 1240 45 2010 BUDGET
## 2010_BUDGET_14_Caoimhghin_OCaolain_SF 1188 4044 176 2010 BUDGET
## number foren name party
## 01 Brian Lenihan FF
## 02 Richard Bruton FG
## 03 Joan Burton LAB
## 04 Arthur Morgan SF
## 05 Brian Cowen FF
## 06 Enda Kenny FG
## 07 Kieran ODonnell FG
## 08 Eamon Gilmore LAB
## 09 Michael Higgins LAB
## 10 Ruairi Quinn LAB
## 11 John Gormley Green
## 12 Eamon Ryan Green
## 13 Ciaran Cuffe Green
## 14 Caoimhghin OCaolain SF
##
## Source: /Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/* on x86_64 by kbenoit
## Created: Wed Jun 28 22:04:18 2017
## Notes:
# let's fix up the doc names so they'll fit on graphs
new.ie.corp <- data_corpus_irishbudget2010
docnames(new.ie.corp) <- paste0(new.ie.corp$documents$party, "_", new.ie.corp$documents$name)
ie.dfm <- dfm(new.ie.corp)
docnames(ie.dfm)
## [1] "FF_Lenihan" "FG_Bruton" "LAB_Burton" "SF_Morgan"
## [5] "FF_Cowen" "FG_Kenny" "FG_ODonnell" "LAB_Gilmore"
## [9] "LAB_Higgins" "LAB_Quinn" "Green_Gormley" "Green_Ryan"
## [13] "Green_Cuffe" "SF_OCaolain"
Let’s start by pinning down our dimension to the endpoints that Wordfish identified (in Lowe and Benoit (2013)): FF’s Cowen (document 5) and Labor’s Burton (document 3).
wordScoreMe(ie.dfm, c(3,5))
Okay, now we’re getting somewhere. The very narrow scope of a particular budget debate means that even with relatively short texts, we’re able to capture the split between government (FF and Green) and the opposition. As in their paper, though, Sinn Fein is quite close to the middle, despite having been human-coded as the most extreme.
How does it look if we pin down on the human-coded extremes?
# Sinn Fein's O'Caolain
wordScoreMe(ie.dfm, c(14,5))
# Sinn Fein's Morgan
wordScoreMe(ie.dfm, c(4,5))
Clearly choice of end points can make a large difference in the spatial positions of authors – great care needs to be taken to validate both the stability of the results and the interpretation that the dimension mapped represents some theoretical dimension of interest.
Wordfish is a fully unsupervised model for scaling a set of documents (or individuals) on a single dimension without having to specify what that dimension is. Since we know from Lowe and Benoit what Wordfish should produce with the Irish budget speeches, let’s start with them.
Rather than specifying the end points, we only need to specify the orientation – choosing one document to be “less than” another. textmodel_wordfish() (in quanteda) will default to the first and second document if we don’t specify a pair.
wf.fit <- textmodel_wordfish(ie.dfm)
summary(wf.fit)
## Call:
## textmodel_wordfish.dfm(x = ie.dfm)
##
## Estimated document positions:
## theta SE lower upper
## FF_Lenihan -1.8209409 0.02032313 -1.86077421 -1.78110753
## FG_Bruton 0.5932874 0.02818872 0.53803755 0.64853732
## LAB_Burton 1.1136883 0.01540288 1.08349862 1.14387791
## SF_Morgan 0.1219197 0.02846356 0.06613111 0.17770827
## FF_Cowen -1.7724139 0.02364057 -1.81874944 -1.72607841
## FG_Kenny 0.7145832 0.02650303 0.66263729 0.76652918
## FG_ODonnell 0.4844867 0.04171532 0.40272465 0.56624872
## LAB_Gilmore 0.5616606 0.02967426 0.50349904 0.61982214
## LAB_Higgins 0.9703181 0.03850628 0.89484574 1.04579036
## LAB_Quinn 0.9589299 0.03892461 0.88263771 1.03522218
## Green_Gormley -1.1807223 0.07221382 -1.32226144 -1.03918325
## Green_Ryan -0.1866513 0.06294146 -0.31001658 -0.06328606
## Green_Cuffe -0.7422019 0.07245392 -0.88421153 -0.60019219
## SF_OCaolain 0.1840564 0.03666329 0.11219639 0.25591648
We’ll have to pull out the elements for our plot one at a time. To make it easier to reproduce, let’s put together a function that will go from model to plot. To do so, we’ll need to extract the features we want into a data.frame using the S4 notation
wordFishMe <- function(df) {
wf.fit <- textmodel_wordfish(df)
wf.df <- data.frame(name=wf.fit@docs,
theta=wf.fit@theta,
se=wf.fit@se.theta)
ggplot(wf.df, aes(x=name, y=theta)) +
geom_errorbar(aes(ymin=theta - 1.96 * se,
ymax=theta + 1.96 * se),
width=.1) +
geom_point() +
theme(axis.text.x=element_text(angle=45, hjust=1))
}
And then run it for the Irish budget speeches from Lowe and Benoit.
df <- data.frame(word = wf.fit@features, psi=wf.fit@beta)
head(arrange(df, psi), n=20)
## word psi
## 1 shortly -2.882330
## 2 innovation -2.852359
## 3 day-to-day -2.773505
## 4 summary -2.746085
## 5 stabilise -2.708261
## 6 sought -2.641552
## 7 arrangements -2.641552
## 8 intention -2.593868
## 9 innovative -2.550237
## 10 fallen -2.550237
## 11 retrofit -2.550237
## 12 base -2.492989
## 13 maintain -2.492989
## 14 electric -2.492989
## 15 vrt -2.492989
## 16 phase -2.453319
## 17 concern -2.415608
## 18 repayments -2.415608
## 19 125,000 -2.415608
## 20 treatment -2.415608
head(arrange(df, desc(psi)), n=20)
## word psi
## 1 sharing 4.741609
## 2 medicine 4.741609
## 3 gang 4.741609
## 4 bailout 4.366983
## 5 post 4.366983
## 6 graduates 3.912543
## 7 consensus 3.904842
## 8 one's 3.904842
## 9 apprentices 3.904842
## 10 soft 3.904842
## 11 lifting 3.904842
## 12 awful 3.904842
## 13 trolley 3.904842
## 14 game 3.904842
## 15 sunday 3.904842
## 16 borrowed 3.904842
## 17 echo 3.904842
## 18 men 3.641201
## 19 societies 3.547394
## 20 remarkable 3.292733
wordFishMe(ie.dfm)
Alright, so we look like Lowe and Benoit’s results – including placing the most anti-budget human-coded Sinn Fein in the very center of the score dimension.
Let’s see if Wordfish does any better with our inauguration speeches than Wordscore did.
wordFishMe(inaug.dfm)
We’re able to pull out a lot more variation than we were with Wordscores. However, the meaning of the dimension isn’t entirely clear. There’s a definitive pre-war, intra-war, and post-war split, so one hypothesis might be that we are capturing a temporal dimension. The three discontinuities map to the shift to radio broadcast with Roosevelt and then television broadcast with Kennedy so that might be it as well.
Maybe if we narrow our scope to post-Eisenhower, we’ll be able to recover a political dimension?
inaug.post.war <- dfm(corpus_subset(data_corpus_inaugural, Year>1960))
wordFishMe(inaug.post.war)
Two things are worthy of note here. First, the pattern still appears to be primarily temporal – this time mapping to a Cold-War and post-Cold-War inaugurations.
Interestingly, the relative pattern between the included presidents is similar to what we saw between them when fitting the full set. This may be coincidence, or it may indicate that we’re not pinning down what’s driving positioning within this dimension.
Rather than imposing a uni-dimensional spatial model on the data, correspondence analysis can allow us to pull out a multi-dimensional spatial model from the texts. The ca package implements this approach. This lab will implement a simple two-dimension CA. To find out more about running it with larger number of spatial dimensions, check out the ca documentation. Conveniently, quanteda has a wrapper function for ca as well.
# set the number of dimensions (nd) to 2
wca <- textmodel_ca(inaug.dfm, nd = 2)
We can compare our first dimension to the dimension produced using word fish to see that the two are highly correlated.
wf.fit <- textmodel_wordfish(inaug.dfm)
plot.df <- data.frame(wordfish = wf.fit@theta,
ca.dim1 = wca$rowcoord[,1],
ca.dim2 = wca$rowcoord[,2],
name = wf.fit@docs)
ggplot(plot.df, aes(x=wordfish, y=ca.dim1, label=name)) +
geom_text(hjust=0) +
geom_point()
cor(wf.fit@theta, wca$rowcoord[,1])
## [1] 0.9363208
So, our first dimension in CA is mostly picking up the same dimension as Wordfish. What is our second dimension getting?
ggplot(plot.df, aes(x=ca.dim1, y=ca.dim2, label=name)) +
geom_text(hjust=0) +
geom_point()
Taftness and Hardingness. Let’s zoom in on the people who aren’t defining a dimension and see how Taftness and Hardingness map into what we know about them
ggplot(plot.df, aes(x=ca.dim1, y=ca.dim2, label=name)) +
geom_text(hjust=0) +
geom_point() +
xlim(-1,1.1) + ylim(-0.8,0.8)
## Warning: Removed 4 rows containing missing values (geom_text).
## Warning: Removed 4 rows containing missing values (geom_point).
This draws on material prepared by Clark Bernier.